Group 2: Phase 2 - Cats vs Dogs Detector (CaDoD)

Team Members

Project Abstract - updated for Phase 2

One of the fundamental tasks in classifying images is object detection within images. Algorithms often employ a ‘bounding box’ tool. To study bounding boxes, our team first evaluated 3 models with GridSearchCV, to be trained on existing bounding box data for the purpose of predicting bounding boxes. The best model was used for bounding box prediction.

For the second phase, we created several PyTorch models with classification and bounding box predictions using Cross Entropy and Mean Squared Error loss functions. PyTorch allows for a much simpler modeling, training and prediction process, though the Nvidia GPU made a move to Google Colab necessary.

Project Description

The purpose of this project is create an end to end process in machine learning to create an object classifier and bounding box predictor for cat and dog images. There are about 13,000 images of varying shapes and aspect ratios. They are all RGB images and have bounding box coordinates stored in a .csv file. In order to create a detector, we will first have to preprocess the images to be all of the same shapes, and flatten them from a 3D array to 2D. Then we will feed this array into a linear localization predictor and a logistic regression model to predict labels and bounding boxes.

Articles for PyTorch

https://heartbeat.fritz.ai/basics-of-image-classification-with-pytorch-2f8973c51864

https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html

Data Description - small updates for Phase 2 (connector to gdrive)

The image archive cadod.tar.gz is a subset Open Images V6. It contains a total of 12,966 images of dogs and cats.

Image bounding boxes are stored in the csv file cadod.csv. The following describes whats contained inside the csv.

The attributes have the following definitions:

Identifying columns: ImageID, Source, LabelName, Confidence

Dimensional and positional columns: XMin, XMax, YMin,YMax, XClick1X, XClick2X, XClick3X, XClick4X, XClick1Y, XClick2Y, XClick3Y, XClick4Y

Bounding box and image descriptive columns: IsOccluded, IsTruncated, IsGroupOf, IsDepiction, IsInside

When we look at a few random images, we can see that the photos vary in color and have different shapes and sizes. Also, we can see a photo with both a cat and dog, with the cat being barely visible (bottom row middle) so this shows any classifier fit on this type of photos will have to be robust.

Sample Images

The first step to prepare data must be to standardize the images. Photos will have to be reshaped before modeling so that all images have the same shape and size. One approach we may use would be to load all photos and look at the distribution of the photo widths and heights then determine a new image size that fits the majority of the images. Smaller size allows a model to train more quickly. Another approach would be to start with a fixed size of 200x200 pixels. We can also filter color images to determine where the majority or highest density of each color pixel lies within the image.

The metadata contained in the csv file will need to be matched to each image file, and during Exploratory Data Analysis, we will determine relationships between any of the columns using pandas. For example, how many images contain more than one cat or dog (IsGroupOf)? How many of those images have IsOccluded, IsTruncated, IsInside? Can we determine if the bounding box of one object is larger than the other in order to guess the ‘main’ object? This will drive creation of additional features.

The code and project files are stored in a GitHub repository: i526Sp21Group2 (SEE PDF). We will impute missing data and document the strategy used, if needed (depending on the results of EDA). NumPy DataFrames embedded in our project Jupyter Notebook will track our exploration and transformation of data and engineering of any features ahead of training and fitting. Other Python libraries may be used for visualizations and will be documented.

Import Data - unchanged for Phase 2

Unarchive data

Place the cadod.tar.gz into the same folder as this notebook. We've already extracted the files into the ./data folder (to prevent committing the large gz file to github).

Load bounding box meta data

The metadata in the CSV file is for training for the bounding box prediction only.

Exploratory Data Analysis - unchanged for Phase 2

Statistics

Replace LabelName with human readable labels

Sample of Images

By plotting random samples of the images along with the bounding boxes and XClick points, we see that every image has a bounding box but not every image has valid (positive) XClick information. From the descriptions on the CaDoD site, the bounding boxes were either derived from the extreme points clicked (aka XClick) by a human, or provided in some other way (prediction or manually drawn as a box).

Further, it seems that the XClick items follow no system. For example, they are not in a predictable clockwise or counterclockwise ordering, they do not seem to start on a particular edge (like left side vs right side).

Since the bounding box information is more widely available and, where XClick is present, bounding box is derived from XClick, we can drop the XClick feature later on and just focus more on examining the bounding box attributes.

Let's look at the IsDepiction column. This seems to indicate if the image is a depiction (drawing, painting, not a 'real' animal photo).

There are 3 items with -1 values. Let's look at those images to see what they're like.

Let's look at the 3 images with IsDepiction = -1

Well, it looks from the dataframe above that these 3 images from Source = "activemil" don't have any XClick info or other info, though they do have bounding boxes. These are NOT depictions/drawings though, so we can look for -1 values to clean up in our Feature Engineering stage.

What else can we tell from this? Let's see if there is any other -1 data to deal with.

Gladly, this tells us that the same 3 images have -1 data, but others are ok (using either 0 or 1 for the IsOccluded, IsTruncated, IsGroupOf or IsInside columns.

Image shapes and sizes

Go through all images and record the shape of the image in pixels and the memory size

Count all the different image shapes

There are a ton of different image shapes. Let's narrow this down by getting a sum of any image shape that has a count less than 100 and put that in a category called other

Drop all image shapes

Check if the count sum matches the number of images

Depiction Images

Now, let's look at a random sample of images marked as "Depictions" (IsDepiction = 1). This would indicate a drawing or other nonstandard image representation of a dog or cat and we need to verify this assumption.

Well, it doesn't seem like IsDepiction tells us a whole lot. Sometimes, IsDepiction is set to 1 for paintings, statues, or just 'heavily filtered' or artistically distorted photos, but also photos that have low light or are black and white are tagged with this feature.

Plot aspect ratio

PyTorch Implementation

Results / Discussion

We had issues getting the Net model (referred to in our Powerpoint as Conv2d) running.

We implemented a SimpleLinear model and got it working, but it had poor performance (53% accuracy). We need to explore image transformations and other options to increase efficiency.

On the last day of Phase 2, we created a combination model (called FrankenNet) that looks promising with 95% training and 73% testing accuracy on classification (we have yet to add the bounding box prediction class). Please see our secondary notebook submitted (Group2_Phase2_FritzAI_inspired.ipynb) for preliminary work on that model.

We ran into many problems getting Google Colab to work reliably with Google Drive data loading, and unfortunately none of our team members own a PC with GPU. There is an issue with the Google Drive connector that sometimes causes 'file not found' errors even though the drive location appears valid.

We developed the SimpleLinear model primarily on Google Colab, but after running several training sessions, also ran out of "free" GPU time on Colab, so Colab does have significant limitations despite the nice ability to run a GPU remotely.

Thus far, it seems that PyTorch and using tensors can really speed up workflows, but the mind shift required to think about data as tensors, and thinking about the features feeding through neural network layers is heavy lift and steep learning curve to overcome.

Unfortunately, our SimpleLinearNet doesn't do well with cats (our model outputs % likelihood, not a firm class guess). The bounding boxes are also not great, but to be expected for 53/54% accuracy. A little better than flipping a coin!

Conclusion

We aimed to train a model to predict bounding boxes based on provided images and then predict whether each image was a dog or cat as a classification step, using PyTorch and neural networks. Image classification is a complex machine learning problem. Focusing on a subset of data allowed a short-term project to be approachable. Class prediction based on bounding-boxes alone doesn’t seem to indicate a high probability of success.

We achieved about 53% accuracy on SimpleLinear and intended to refine FrankenNet transformations to reduce overfitting, and time limitations affected our outcome. Two of our three model training processes function properly, but the more promising requires more work on bounding box predictions. We hope to refine FrankenNet for our final submission.

Next Steps

Tune FrankenNet model and image transformations to improve prediction performance, accuracy, and fix overfitting problem. Perform additional metric analysis when we have more model results to work with (time limitations).